[core] Introduce BucketSelector based on partition values to achieve bucket level predicate push down by JingsongLi · Pull Request #7486 · apache/paimon

JingsongLi · 2026-03-19T11:35:02Z

Purpose

Introducing BucketSelector based on partition values to achieve bucket level predicate push down optimization.

Case 1: bucket filtering with compound predicates on a single-field bucket key.

Table schema:

Partition key: column 'a' (INT)
Bucket key: column 'b' (INT)
Bucket count: 10

Data distribution: 5 partitions (a=1 to 5) × 20 b-values (b=1 to 20) = 100 rows.

Scenarios:

Predicate: (a < 3 AND b = 5) OR (a = 3 AND b = 7) - Tests partition range filter with bucket equality, combined with OR. Expected: buckets for partition 1,2 with b=5 and partition 3 with b=7.
Predicate: (a < 3 AND b = 5) OR (a = 3 AND b < 100) - Tests partition range with bucket equality, OR partition equality with bucket range. Expected: mixed buckets from partition 3 and specific buckets from partitions 1,2.
Predicate: (a = 2 AND b = 5) OR (a = 3 AND b = 7) - Tests partition equality with bucket equality in both OR branches. Expected: exact bucket matching for each partition-b combination.

Case2: bucket filtering with compound predicates on a composite (multi-field) bucket key.

Table schema:

Partition key: column 'a' (INT)
Bucket key: columns 'b' and 'c' (composite, INT)
Bucket count: 10

Data distribution: 5 partitions (a=1 to 5) × 20 b-values (b=1 to 20) × 10 c-values (c=0 to 9) = 1000 rows.

Test scenarios:

Predicate: ((a < 3 AND b = 5) OR (a = 3 AND b = 7)) AND c = 5 - Tests nested OR within AND, with partition range, bucket field equality, and additional bucket field filter. The 'c = 5' condition is part of the composite bucket key, affecting bucket selection.
Predicate: ((a < 3 AND b = 5) OR (a = 3 AND b < 100)) AND c = 5 - Tests range predicate on one bucket field (b) combined with equality on another (c). Validates handling of multiple bucket key fields with different predicate types.
Predicate: ((a = 2 AND b = 5) OR (a = 3 AND b = 7)) AND c = 5 - Tests exact matching on both partition and bucket fields. The composite bucket key (b,c) ensures precise bucket targeting.

Tests

API and Format

Documentation

Generative AI tooling

…bucket level predicate push down

Copilot

Pull request overview

Introduces a partition-aware BucketSelector to enable bucket-level predicate pushdown for compound predicates by evaluating partition predicates against concrete partition values during scan planning.

Changes:

Add BucketSelector + PartitionValuePredicateVisitor and wire full-predicate propagation via withCompleteFilter to enable partition-aware bucket pruning.
Update bucket filtering plumbing to use TriFilter<BinaryRow, Integer, Integer> (partition, bucket, totalBucket) end-to-end.
Add new unit/integration tests covering compound predicate bucket pruning (single-field and composite bucket keys).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
paimon-core/src/main/java/org/apache/paimon/operation/BucketSelector.java	New partition-aware bucket selector that derives candidate buckets from predicates.
paimon-core/src/main/java/org/apache/paimon/operation/BucketSelectConverter.java	Refactored to return a `BucketSelector` (TriFilter) when eligible.
paimon-core/src/main/java/org/apache/paimon/table/source/snapshot/SnapshotReaderImpl.java	Passes the full predicate to scans via `withCompleteFilter`.
paimon-core/src/main/java/org/apache/paimon/operation/FileStoreScan.java	Adds `withCompleteFilter` and switches total-aware bucket filter to `TriFilter`.
paimon-core/src/main/java/org/apache/paimon/operation/AppendOnlyFileStoreScan.java	Implements `withCompleteFilter` to install bucket-level pruning.
paimon-core/src/main/java/org/apache/paimon/operation/KeyValueFileStoreScan.java	Implements `withCompleteFilter` to install bucket-level pruning.
paimon-core/src/main/java/org/apache/paimon/operation/AbstractFileStoreScan.java	Threads partition value into bucket filtering during manifest entry filtering.
paimon-core/src/main/java/org/apache/paimon/manifest/BucketFilter.java	Bucket filter now tests with `(partition, bucket, totalBucket)` via `TriFilter`.
paimon-core/src/main/java/org/apache/paimon/manifest/ManifestEntryCache.java	Applies bucket filtering with partition context when scanning cached segments.
paimon-core/src/main/java/org/apache/paimon/AppendOnlyFileStore.java	Constructs the new `BucketSelectConverter` instance.
paimon-core/src/main/java/org/apache/paimon/KeyValueFileStore.java	Constructs the new `BucketSelectConverter` instance.
paimon-common/src/main/java/org/apache/paimon/utils/TriFilter.java	New 3-arg filter functional interface used for bucket pruning.
paimon-common/src/main/java/org/apache/paimon/predicate/PartitionValuePredicateVisitor.java	New visitor that evaluates partition-only leaf predicates against a concrete partition row.
paimon-common/src/main/java/org/apache/paimon/predicate/PredicateReplaceVisitor.java	Uses `PredicateBuilder.and/or` to simplify rebuilt compound predicates.
paimon-core/src/test/java/org/apache/paimon/table/BucketFilterScanTest.java	New integration test validating bucket pruning under compound predicates (single/composite keys).
paimon-core/src/test/java/org/apache/paimon/operation/BucketSelectorTest.java	New unit tests for bucket selection behavior across predicate patterns and partitioned tables.
paimon-common/src/test/java/org/apache/paimon/predicate/PartitionValuePredicateVisitorTest.java	New unit tests validating predicate rewriting for partition values.
paimon-core/src/test/java/org/apache/paimon/operation/BucketSelectConverterTest.java	Removed (replaced by `BucketSelectorTest`).
paimon-core/src/test/java/org/apache/paimon/manifest/BucketFilterTest.java	Removed (behavior now exercised via updated scan/bucket selector tests).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T04:06:19Z

paimon-core/src/main/java/org/apache/paimon/operation/BucketSelector.java

+        int rowCount = 1;
+        for (List<Object> values : bucketValues) {
+            if (values == null) {
+                return Optional.empty();
+            }
+
+            rowCount *= values.size();
+            if (rowCount > MAX_VALUES) {
+                return Optional.empty();
+            }


In BucketSelector, rowCount *= values.size() uses int multiplication and can overflow into a negative value (e.g., two large IN-lists on a composite bucket key). That can bypass the rowCount > MAX_VALUES guard and then assembleRows may attempt to enumerate an enormous Cartesian product, risking hangs/OOM. Consider switching rowCount to long (or using checked multiplication) and short-circuiting as soon as the product exceeds MAX_VALUES (ideally also cap/dedup per-column value collection early to avoid holding huge literal lists).

Suggested change

int rowCount = 1;

for (List<Object> values : bucketValues) {

if (values == null) {

return Optional.empty();

}

rowCount *= values.size();

if (rowCount > MAX_VALUES) {

return Optional.empty();

}

long rowCount = 1L;

for (List<Object> values : bucketValues) {

if (values == null) {

return Optional.empty();

}

int size = values.size();

if (size <= 0) {

// No possible values for this column, no matching rows

return Optional.empty();

}

// Check for overflow and limit total number of combinations

if (rowCount > (long) MAX_VALUES / size) {

return Optional.empty();

}

rowCount *= size;

Copilot · 2026-03-20T04:06:20Z

paimon-core/src/test/java/org/apache/paimon/table/BucketFilterScanTest.java

+            builder.option(BUCKET_KEY.key(), "b");
+        }
+        Schema schema = builder.build();
+
+        Identifier tableId = identifier("test_bucket_filter");
+        catalog.createTable(tableId, schema, false);
+        Table table = catalog.getTable(tableId);
+
+        // ---- write data: 5 partitions × 20 b-values = 100 rows ----
+        GenericRow[] rows = new GenericRow[100];
+        int idx = 0;


These assertions hard-code specific bucket IDs (e.g., "3,1", "1,6"). That makes the test brittle to changes in the bucket hash implementation / BucketFunctionType defaults, even if bucket-level pruning is still correct. Consider computing expected bucket IDs using the same BucketFunction as production (and asserting on those), so the test validates pruning behavior without depending on a particular hash result.

Copilot · 2026-03-20T04:06:20Z

paimon-common/src/main/java/org/apache/paimon/utils/TriFilter.java

+
+package org.apache.paimon.utils;
+
+/** Represents a filter (boolean-valued function) of three argument. */


Typo/grammar in the Javadoc: "filter ... of three argument" should be "... of three arguments".

Suggested change

/** Represents a filter (boolean-valued function) of three argument. */

/** Represents a filter (boolean-valued function) of three arguments. */

Copilot · 2026-03-20T04:06:20Z

paimon-core/src/test/java/org/apache/paimon/table/BucketFilterScanTest.java

+        }
+        Schema schema = builder.build();
+
+        Identifier tableId = identifier("test_composite_bucket_filter");
+        catalog.createTable(tableId, schema, false);
+        Table table = catalog.getTable(tableId);
+
+        // ---- write data: 5 partitions × 20 b-values x 10 c-values = 1000 rows ----
+        GenericRow[] rows = new GenericRow[1000];
+        int idx = 0;
+        for (int a = 1; a <= 5; a++) {
+            for (int b = 1; b <= 20; b++) {
+                for (int c = 0; c < 10; c++) {


Same brittleness here: the expected results are hard-coded bucket IDs for composite keys (e.g., "3,9", "3,0"...). To keep the test stable across bucket hash / BucketFunction changes, consider deriving expected bucket IDs via BucketFunction from the (b,c) literals instead of asserting specific numeric buckets.

jerry-024 · 2026-03-20T04:57:43Z

+1

JingsongLi changed the title ~~[core] Introduce BucketSelector by partition value~~ [core] Introduce BucketSelector based on partition values to achieve bucket level predicate push down Mar 19, 2026

[core] Introduce BucketSelector based on partition values to achieve …

858c506

…bucket level predicate push down

JingsongLi force-pushed the BucketSelector_by_partition branch from 686a764 to 858c506 Compare March 20, 2026 00:40

Zouxxyy requested a review from Copilot March 20, 2026 03:56

Copilot started reviewing on behalf of Zouxxyy March 20, 2026 03:57 View session

Fix Bug from comments

0c737b8

Copilot AI reviewed Mar 20, 2026

View reviewed changes

JingsongLi closed this Mar 20, 2026

JingsongLi reopened this Mar 20, 2026

JingsongLi merged commit 974b725 into apache:master Mar 20, 2026
19 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Introduce BucketSelector based on partition values to achieve bucket level predicate push down#7486

[core] Introduce BucketSelector based on partition values to achieve bucket level predicate push down#7486
JingsongLi merged 2 commits intoapache:masterfrom
JingsongLi:BucketSelector_by_partition

JingsongLi commented Mar 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

jerry-024 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		package org.apache.paimon.utils;

		/** Represents a filter (boolean-valued function) of three argument. */

Conversation

JingsongLi commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

jerry-024 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JingsongLi commented Mar 19, 2026 •

edited

Loading